Diabetes on PIMA Dataset

In this project we will apply a Fully Bayesian Logistic Regression model aimed at predicting the onset of diabetes mellitus in Pima Indians given diagnostic measurements.
Diabetes is a metabolic disorders characterized by a high blood sugar level (hyperglycemia) over a prolonged period of time. It is mainly due to the reduction of insulin production or lack of capability by the cells to properly respond to the insulin produced.

The dataset used is Pima Indians Diabetes Dataset, publicly available on Kaggle : https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database, containing medical information of female patients belonging to Pima Indian heritage of age \(\geq\) 21.

It consist of several medical predictor variables:

For each patient, the binary target variable Outcome indicates whether or not she is affected by Diabetes.

Exploratory Data Analysis

Daling with missing data

First, let’s look for missing/inconsistent values in our data.

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
Min. 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0780 21.0000
1st Qu. 1.0000 99.0000 62.0000 0.0000 0.0000 27.3000 0.2438 24.0000
Median 3.0000 117.0000 72.0000 23.0000 30.5000 32.0000 0.3725 29.0000
Mean 3.8451 120.8945 69.1055 20.5365 79.7995 31.9926 0.4719 33.2409
3rd Qu. 6.0000 140.2500 80.0000 32.0000 127.2500 36.6000 0.6262 41.0000
Max. 17.0000 199.0000 122.0000 99.0000 846.0000 67.1000 2.4200 81.0000
Missing Values 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Zeros 111.0000 5.0000 35.0000 227.0000 374.0000 11.0000 0.0000 0.0000

  

There are no missing data, but several meaningless zero values, that we can consider as NaN (e.g., BloodPressure = 0). So, we replaced the zeros with the per-class median in the following fields:

  • Glucose
  • BloodPressure
  • BMI

For what concerns the Skin Thickness and Insulin, the amount of missing values is respectively 0.296 and 0.296. Hence, such replacement would be meaningless. Therefore, we removed those features.

Feature Distribution

Here it is represented the Box Plot for all the features

Univariate Analysis

In the following, they are reported some plots representing the distribution of the different features. Please notice that the values reported refer to intra-class percentages, i.e., the counts are normalized by dividing for the carnality of the class they refer to. By doing so, we are able to compare the different categories regardless from the class unbalance of the data set.

Pregnancies

In this dataset, diabetes seems more common among women that had experienced many pregnancies. Although gestation per se cannot cause diabetes (except for gestational diabetes, which is a specific temporary condition), an high number of pregnancies can lead to an hormonal imbalance or weight gain, which may increase the risk of developing diabetes.

Age

In this dataset, diabetes seems more common across people above age of 35; this is possibly because that Type I diabetes can arise at any age, while Type II diabetes is more common across people of age \(\geq 40\).

Glucose

Not surprisingly, diabetic people have an higher glycemic index, being this is one of the hillness’ symptoms.

Blood Pressure

Diabetic people seem to suffer more from hypertension than non-diabetic patients. This may be due to the fact that some forms of diabetes (Type II) are associated with high weight, which on turn may cause high blood pressure.

Diabetes Pedigree Function

Body Mass Index

Diabetes is more common among obese people; indeed obesity is one of the risk factor for developing Type II diabetes.

Correlation and Mulitcollinearity Check

The features in our dataset are weakly correlated, hence in the regression, there won’t be Multicollinearity problems.

Multivariate analyisis

Age and number of Pregancies are weakly Diabeticly correlated (\(\rho \approx 0.5\) ). This is not surprising since elder women are more likely to have had an higher number of Pregnancies during their life.

Class distribution

In order to overcome the severe class imalance of our data, we applied SMOTE: a data augmentation strategy consisting in oversampling the minority class (Diabetic in our case).

Eventually we obtained a dataset with the same statistics as above, but more balanced!

The Model

We will use Bayesian Logistic Regression to model the relationship between the binary target variable \(Y \in \{0,1\}\) (diabetic = 1, healthy = 0) with the input features \(x_j\ \ ,\ \ \ j=1,...,p\):

feature name variable name
Pregnancies x1
Glucose x2
BloodPressure x3
BMI x4
DiabetesPedigreeFunction x5
Age x6
Outcome Y

Moreover, in the following we will consider the canonical notation: \(x_0=1\) in order to model the intercept. Differently from Bayesian linear regression, there is no variance term to be estimated, and only the regression parameters (\(\beta_j\)) will be estimated.

For each patient, the target variable can be modeled as a Bernoulli: \[ Y|\boldsymbol{x} \sim Ber(\theta_{\boldsymbol{x}})\] Hence: \[ \Pr(Y = 1|\boldsymbol{x}) = \mathbb{E}[Y = 1|\boldsymbol{x}] = \theta_\boldsymbol{x} \] In logistic regression, the Link function that links \(\mathbb{E}[Y = 1|\boldsymbol{x}]\) with the linear predictor \(\boldsymbol{\beta \cdot x} = \sum_{j} \beta_j \cdot x_j\) is the Logit Function:

\[ \theta_{\boldsymbol{x}} = \text{ilogit}\left(\boldsymbol{\beta \cdot x} \right) \iff \boldsymbol{\beta \cdot x} = \text{logit} \left(\theta_{x}\right) \]

where: \[ \text{ilogit}\left(z\right) = \frac{\exp(z)}{1+\exp(z)} \] is the sigmoid function, forcing \(\boldsymbol{\beta \cdot x_i}\) in the \([0,1]\) range;

and \[\text{logit}(\pi) = \log\left(\frac{\pi}{1-\pi}\right)\] Is the logit function (inverse of the sigmoid) computing the logarithm of the odds.

Likelihood

The overall dataset is randomly shuffled and sliced in Train and Test sets according to a 80/20 split.

  • The Training examples represent the observations in our Bayesian Analysis.
  • The Test set will be used to measure the performances of our model and compare them with the ones achieved by other classification ML techniques.

Let’s say we have \(n\) observations; we can define their likelihood function as follows: \[ L_{\boldsymbol{y}}(\boldsymbol{\beta}, \boldsymbol{X}) = f(\boldsymbol{y}|\boldsymbol{\beta},\boldsymbol{X}) = \prod_{i= 1}^n f(y^{(i)}|\boldsymbol{\beta,x^{(i)}}) = \prod_{i= 1}^n \text{ilogit}( \boldsymbol{\beta \cdot x^{(i)}}) ^{y^{(i)}}\left(1-\text{ilogit}( \boldsymbol{\beta \cdot x^{(i)}})\right)^{1-y^{(i)}} \]

Joint prior of the parameters

The joint prior distribution of the parameters on \(\mathbf{B}^p = \{(-\infty, +\infty)\}^{p}\) can be computed as the product of their marginal distributions:

\[ \pi(\boldsymbol{\beta}) = \prod_{j= 1}^{p} \pi(\beta_j) = \pi(\beta_0)\cdot \pi(\beta_1)\cdot ...\cdot \pi(\beta_p) \]

In our case we will consider two kind of priors, expressing our belief that each \(\beta_j\) represents the true population characteristics (under our modellistic assumptions), prior to the observation of any data :

  • Normal distribution:

\[ \boldsymbol{\beta} \sim N_p(\boldsymbol{\mu}, \mathbb{I}_p\sigma^2) \iff\beta_j \overset{iid}{\sim} N\left(\mu_j, \sigma_j^2= 10^4\right) \ \ \ j = 1,..., p \]

  • Laplace distribution:

\[ \beta_j \overset{iid}{\sim} \text{Laplace}\left(\mu_j,b_j = \frac{10^2}{\sqrt2}\right) \ \ \ j = 1,..., p \] For what conernes the hyperparameters, we consider both Normal and Laplace distribution centered in 0 (\(\mu_j = 0 \ , \ \ j = 1,..., p\)) Since we have weak prior knowledge, a diffuse prior distribution is used, spreading more or less evenly the probability distribution over \(\beta_j\), which is modeled in terms of assigning an high variance to the prior:

  • Gaussian: \(\sigma^2_j = 10^4 \ , \ \ j = 1,..., p\) .
  • Laplace: \(b_j = \frac{10^2}{2} \iff \mathbb{V}\text{ar}(\beta_j)= 10^4 \ , \ \ j = 1,..., p\) .

We will then compare the Deviance Information Criterion of the two models to select the most fitting one.

The Posterior

In Bayesian inference we combine our prior beliefs about the model parameters - expressed in terms of prior distribution - with the likelihood of data. The Posterior distribution express our belief that \(\beta_j\) is the true value, having observed the data. The engine that allow us to obtain the posterior distribution from the prior distribution is the Bayes Rule: \[ \text{Posterior} \propto \text{Prior}\times \text{Likelihood} \]

\[ \pi(\boldsymbol{\beta}|\boldsymbol{y,X}) \propto L_{\boldsymbol{Y}}(\boldsymbol{\beta,X})\times \pi(\boldsymbol{\beta}) = \\ \text{[Gaussian marginal priors]}\\ =\prod_{i= 1}^n \left(\frac{\exp( \boldsymbol{\beta \cdot x^{(i)}})}{1+\exp( \boldsymbol{\beta \cdot x^{(i)}}) }\right)^{y^{(i)}} \left(1- \frac{\exp( \boldsymbol{\beta \cdot x^{(i)}})}{1+\exp( \boldsymbol{\beta \cdot x^{(i)}}) }\right)^{1-y^{(i)}} \prod_{j=0}^{p} {\frac {1}{\sigma_{j} {\sqrt {2\pi }}}}\exp \left(-{\frac {1}{2}}{\frac {(\beta_j-\mu_j )^{2}}{\sigma_{j} ^{2}}}\right) \\ \text{[Laplace marginal priors]}\\ =\prod_{i= 1}^n \left(\frac{\exp( \boldsymbol{\beta \cdot x^{(i)}})}{1+\exp( \boldsymbol{\beta \cdot x^{(i)}}) }\right)^{y^{(i)}} \left(1- \frac{\exp( \boldsymbol{\beta \cdot x^{(i)}})}{1+\exp( \boldsymbol{\beta \cdot x^{(i)}}) }\right)^{1-y^{(i)}} \prod_{j=0}^{p} {\frac {1}{b_{j}}}\exp \left(-{\frac {1}{2}}{\frac {|\beta_j-\mu_j |}{\sigma_{j} ^{2}}}\right) \]

Full Conditionals

Now, let’s derive the full conditionals In the Gaussian Prior case, this is equal to: \[ \pi(\beta_j|\boldsymbol{y,X,\beta_{(-j)}}) \propto \pi(\boldsymbol{y}|\boldsymbol{X,\beta}) \pi(\beta_j) = \prod_{i= 1}^n \left(\frac{\exp( \boldsymbol{\beta \cdot x^{(i)}})}{1+\exp( \boldsymbol{\beta \cdot x^{(i)}}) }\right)^{y^{(i)}} \left(1- \frac{\exp( \boldsymbol{\beta \cdot x^{(i)}})}{1+\exp( \boldsymbol{\beta \cdot x^{(i)}}) }\right)^{1-y^{(i)}} \cdot {\frac {1}{\sigma_{j} {\sqrt {2\pi }}}}\exp \left(-{\frac {1}{2}}{\frac {(\beta_j-\mu_j )^{2}}{\sigma_{j} ^{2}}}\right) \]

In the Laplace Prior case, this is equal to: \[ \pi(\beta_j|\boldsymbol{y,X,\beta_{(-j)}}) \propto \pi(\boldsymbol{y}|\boldsymbol{X, \beta}) \pi(\beta_j)= \prod_{i=1}^{n} \left(\frac{\exp( \boldsymbol{\beta \cdot x^{(i)}})}{1+\exp( \boldsymbol{\beta \cdot x^{(i)}})}\right)^{y^{(i)}}\left(1-\frac{\exp( \boldsymbol{\beta \cdot x^{(i)}})}{1+\exp(\boldsymbol{\beta \cdot x^{(i)}}) }\right)^{1-y^{(i)}} \cdot{\frac {1}{2b_{j}}}\exp \left(-{\frac {|\beta_j-\mu_{j} |}{b_{j}}}\right) \]

Both distributions do not belong to any known parametric families; hence, it is not possible to directly draw i.i.d. simulations from them in order to provide estimates for \(\boldsymbol{\beta}\); we will hence rely on Markov Chain simulations to obtain estimates for the unknown parameters.

Metropolis-within-Gibbs algorithm

Gibbs Sampling

Gibbs sampling is an algorithm aimed to simulate a suitable Markov Chain:

  1. START : We start at time \(t=0\) with a fixed initial point \(\beta_j^{0}=x^{(0)}=(x_1,…,x_k)\)
  2. ITERATE: For \(j=1,..,p\):

\[ \beta_{j}^{(t+1)} = \pi\left(\beta_{j}|\boldsymbol{\beta_{(-j)}}\right) = \pi\left(\beta_{j} |\beta_0^{(t+1)}, \beta_{1}^{(t+1)}, ..., \beta_{j-1}^{(t+1)}, \beta_{j+1}^{(t)},...,\beta_{p}^{(t)} \right) \]

This per se does not provide a strictly stationary Markov Chain, but:

  • If we start \(X_0\sim \pi(\cdot)\), then the Markov Chain is strictly stationary
  • If we wait for a suitable time, then the Markov Chain is approximately stationary

We can look at each update - which is temporary within the Gibbs cycle - as a transition from one state to another.

\[ \beta_j^{(t+1)} = \pi \left(\beta_j |\boldsymbol{\beta_{(-j)}}\right) \rightarrow K_j \]

Metropolis Hastings

Metropolis-Hastings consists in building a Markov Chain by iteratively draw a candidate \(Y_{t+1}=y\) from a proposal distribution \(p_x(y)\) - which depends on the current state of the chain \(X_{t}=x\) - and decide weather accept it as next state of the chain or remain in the current state \(x\). Hence, the next state of the chain: \[ X_{t+1} = \begin{cases} y \ \ \text{ with probability }\ \ \ \ \ \ \ \ \ \alpha(x,y)\\ x \ \ \text{ with probability } \ 1- \alpha(x,y) \end{cases} \] where: \[ \alpha(x,y) = \min\left\{1,\frac{\pi(y)}{\pi(x)}\frac{p_y(x)}{p_x(y)}\right\} \]

Metropolis-within-Gibbs sampling

Gibbs sampling is a special case of Metropolis algorithm, in which the new proposal is accepted with probability 1. It requires to recognize the full conditionals, which is not our case since any of the \(\beta_j\) belong to a well-known parametric family. In this case we can apply a Metropolis-within-Gibbs algorithm, which consists in replacing the Kernel \(K^{GS}\) from which we are not able to directly simulate in the Gibbs Sampling with the kernel of the Metropolis \(\tilde{K}^{MH}\), having as target density the full conditional of \(\beta_j\).

Jags Models

Normal

model 
{
  for (i in 1:N){
    Y[i] ~ dbern(p[i])
    p[i] <- ilogit(beta0+beta1*x1[i] + beta2*x2[i] + beta3*x3[i] + beta4*x4[i] + beta5*x5[i] + beta6*x6[i])
    
  }
  
  # Defining the prior beta parameters
  beta0 ~ ddexp(0, 1.0E-4)
  beta1 ~ ddexp(0, 1.0E-4)
  beta2 ~ ddexp(0, 1.0E-4)
  beta3 ~ ddexp(0, 1.0E-4)
  beta4 ~ ddexp(0, 1.0E-4)
  beta5 ~ ddexp(0, 1.0E-4)
  beta6 ~ ddexp(0, 1.0E-4)

}
mean sd 2.5% 25% 50% 75% 97.5% Rhat n.eff
beta0 -8.6472 0.7593 -10.1110 -9.1532 -8.6436 -8.1331 -7.2100 1.0005 3000
beta1 0.1168 0.0316 0.0581 0.0951 0.1164 0.1389 0.1787 1.0005 3000
beta2 0.0354 0.0034 0.0289 0.0330 0.0354 0.0376 0.0423 1.0008 3000
beta3 -0.0062 0.0081 -0.0221 -0.0117 -0.0063 -0.0007 0.0101 1.0008 3000
beta4 0.1027 0.0150 0.0738 0.0924 0.1024 0.1128 0.1325 1.0007 3000
beta5 0.8973 0.2855 0.3435 0.7077 0.8944 1.0909 1.4522 1.0019 1400
beta6 0.0130 0.0094 -0.0048 0.0067 0.0128 0.0193 0.0314 1.0007 3000
deviance 849.4309 3.9556 844.1442 846.6033 848.7129 851.5262 858.6929 1.0007 3000
DIC pD
857.2577 7.8269

Laplace

#  MODEL SPECIFICATION 

model 
{
  for (i in 1:N){
    Y[i] ~ dbern(p[i])
    p[i] <- ilogit(beta0+beta1*x1[i] + beta2*x2[i] + beta3*x3[i] + beta4*x4[i] + beta5*x5[i] + beta6*x6[i])
    
  }
  
  # Defining the prior beta parameters
  beta0 ~ ddexp(0, 1.0E-4)
  beta1 ~ ddexp(0, 1.0E-4)
  beta2 ~ ddexp(0, 1.0E-4)
  beta3 ~ ddexp(0, 1.0E-4)
  beta4 ~ ddexp(0, 1.0E-4)
  beta5 ~ ddexp(0, 1.0E-4)
  beta6 ~ ddexp(0, 1.0E-4)

}
mean sd 2.5% 25% 50% 75% 97.5% Rhat n.eff
beta0 -8.6299 0.7467 -10.0184 -9.1809 -8.6562 -8.0994 -7.2196 1.0119 960
beta1 0.1165 0.0317 0.0552 0.0957 0.1162 0.1370 0.1792 1.0012 3000
beta2 0.0350 0.0033 0.0288 0.0327 0.0350 0.0371 0.0416 1.0049 470
beta3 -0.0061 0.0081 -0.0226 -0.0113 -0.0061 -0.0004 0.0092 1.0058 480
beta4 0.1032 0.0152 0.0732 0.0931 0.1034 0.1133 0.1325 1.0111 590
beta5 0.8920 0.2866 0.3391 0.7014 0.8790 1.0800 1.4671 1.0013 2600
beta6 0.0133 0.0093 -0.0044 0.0070 0.0132 0.0192 0.0320 1.0031 3000
deviance 849.3543 3.7506 844.0283 846.6301 848.7282 851.3682 858.7077 1.0055 400
DIC pD
856.3575 7.0032

The 95%-CI for \(\beta_3\) in both models contains the zero, hence the parameter does not matter in our analysis in the relation between the covariates and the outputs. Let’s have a look to the model if we remove the variable \(x_3\) = Blood Pressure:

Normal Removing Null features

mean sd 2.5% 25% 50% 75% 97.5% Rhat n.eff
beta0 -8.8640 0.6967 -10.2138 -9.3367 -8.8709 -8.3993 -7.4989 1.0008 3000
beta1 0.1158 0.0314 0.0548 0.0951 0.1150 0.1367 0.1780 1.0008 3000
beta2 0.0348 0.0034 0.0283 0.0325 0.0348 0.0371 0.0417 1.0006 3000
beta4 0.0995 0.0142 0.0718 0.0898 0.0999 0.1092 0.1275 1.0019 1400
beta5 0.9061 0.2868 0.3602 0.7104 0.9000 1.0992 1.4678 1.0023 1100
beta6 0.0113 0.0090 -0.0060 0.0051 0.0112 0.0175 0.0289 1.0011 3000
deviance 848.8892 3.4988 844.1271 846.4120 848.1873 850.6570 856.9820 1.0036 830
DIC pD
854.9993 6.11

Laplace Removing Null features

mean sd 2.5% 25% 50% 75% 97.5% Rhat n.eff
beta0 -8.8867 0.7055 -10.3427 -9.3491 -8.8694 -8.4005 -7.5533 1.0098 220
beta1 0.1151 0.0321 0.0497 0.0936 0.1151 0.1369 0.1783 1.0023 1100
beta2 0.0348 0.0034 0.0280 0.0326 0.0349 0.0371 0.0414 1.0042 590
beta4 0.0998 0.0153 0.0700 0.0894 0.0998 0.1104 0.1299 1.0107 220
beta5 0.9119 0.2882 0.3601 0.7118 0.9126 1.0993 1.4859 1.0017 3000
beta6 0.0116 0.0093 -0.0061 0.0054 0.0114 0.0179 0.0297 1.0058 380
deviance 849.1369 3.6672 844.1308 846.4201 848.5044 851.0922 858.1590 1.0026 2000
DIC pD
855.8589 6.722

The criterion used for selecting the best model is the Deviance Information Criterion (DIC): a comparative index used for choosing among competing models. It is a measure of goodness of fit of the model (average deviance), discounted by a penalization term,representing the number of parameters (pD):

\[ DIC = p_D+ \hat{D}_{\text{avg}}(\theta) \] where:

\[\hat{D}_{\text{avg}}(\theta) \approx \frac{1}{M} \sum_{i=1}^M -2 \log\left(f\left(y|\theta^{(j)}\right) \right) \]

The model with the smallest DIC assumes Normal Priors and excludes the variable x3; hence, the following analysis will be referred to such setup.

Normal Model (without x3)

  • Traceplot: The traceplots allow to assess the convergence of the markov chain by showing the time series of the chain. All chains seem to be exploring the same region of parameter values, which is a good sign. The expected outcome is to observe a stationary behavior after a number of transitions, which translates into regular oscillation within a certain range due to the chain reaching the target distribution.

  • Density plot: Depicts the simulated sampling distribution of the parameters, for each chain. The yellow vertical line represents the empirical mean, while the horizontal one the 95% quantile based credible interval based on the joint results of all of the three chains.

  • Autocorrelation Function Plot : The Auto Correlation Function(ACF) plots visualize how much the correlation between the simulated values holds in the previous states.

    • On the \(x\)-axis we have the Lag \(h\)
    • On the \(y\)-axis it’s reported \(\rho(h)\), estimate of the correlation among coordinates in the stochastic process which is thought to be stationary: \(\mathbb{C}\text{orr}(\beta_t,\beta_{(t+h)} )\)

If the process becomes stationary, we expect the correlation to vanish as we increase the lag \(h\). In the limit, if \(\rho=0\) the simulations are iid (which of course is not our case).

Running Mean

One of the main goals of performing MCMC is to approximate quantities which are functionals of the target distribution \(\pi(\cdot)\), such as the empirical mean: \[ I = \mathbb{E}_{\pi}[\beta_{j}] \approx \hat{I_t}= \sum_{i=1}^t\frac{\beta_{i}}{t} \]

Where we assume the sample \(\beta_{j}^{(1)}, ..., \beta_{j}^{(t)}\) to be simulated from suitable Markov Chain: invariant w.r.t. \(\pi(\cdot)\). Such condition is satisfied either (i) by considering as starting distribution the stationary distribution or (ii) - under the regularity conditions make the ergodic behavior (SLL) possible - by taking only the samples following the burn-in time \(T_0\), a suitable time such that \(\beta_{j}^{(T_0-1)} \approx \pi(\cdot)\).

By looking the plots we can see that the line that quickly approaches the overall mean. The point in which this happens is precisely the burn-in time.

Posterior uncertainty

As criterion to establish the parameter with the larges the posterior uncertainty, we can look at the width of the quantile-based credible intervals (95%):

Quantile Based
LB UB width
beta0 -10.3427 -7.5533 2.7894
beta1 0.0497 0.1783 0.1285
beta2 0.0280 0.0414 0.0135
beta4 0.0700 0.1299 0.0599
beta5 0.3601 1.4859 1.1258
beta6 -0.0061 0.0297 0.0359
Highest Posterior Density
LB UB width
beta0 -10.3184 -7.5380 2.7804
beta1 0.0483 0.1760 0.1276
beta2 0.0280 0.0414 0.0134
beta4 0.0714 0.1309 0.0595
beta5 0.3445 1.4715 1.1270
beta6 -0.0063 0.0295 0.0358

As we can notice, in both cases the parameter associated to the widest credible interval is the intercept: \(\beta_0\).

Correlation between parameters

The most correlated couple of parameters is \(\beta_4\) and \(\beta_0\).

Errors

The empirical mean is an unbiased estimator; therefore the MSE is equal to its variance. Differently from the Vanilla Monte Carlo case, in which the estimate are computed on iid samples, in the MCMC each realization of the sample depends on the previous one, which on turns depend on its prior and so on.. As a consequence, the variance of the empirical mean of the simulated values cannot be computed as the simulated values divided by the number of simulations, but we must take into account the covariance between each pair of simulated values. The higher the correlation, the larger will be the variance of the approximation provided by the MCMC algorithm with respect to the variance of the iid simulations proper of the MC method. This implies that the Markov Chain is more inefficient than standard iid simulation. This can be quantified by the effective sample size (ESS):

\[ t_{eff}=\frac{t}{\left(1+2\sum_{k=1}^{+\infty} \rho_k \right)} \]

Used to normalize the variance of \(h(\beta_j)\) and get the variance of \(\hat{I}_t\):

\[ \sigma^2_{\hat{I}_t} = \mathbb{V}\text{ar}[\hat{I}_t] =\frac{\mathbb{V}\text{ar}_{\pi}(h(\beta_j))}{t_{eff}} = \frac{1}{\left(1+2\sum_{k=1}^{+\infty} \rho_k \right)} \cdot \frac{\sigma^2}{t} \]

Another estimate of the inaccuracy of the MC samples is the Monte Carlo Standard Error (MCSE). It measures the standard deviation around the posterior mean of the samples due to the uncertainty of the MCMC algorithm.

ESS IFiid MCMCerror MCSEerror
beta0 1000 0.0005 0.0005 0.0216
beta1 1000 0.0000 0.0000 0.0010
beta2 1000 0.0000 0.0000 0.0001
beta4 1000 0.0000 0.0000 0.0004
beta5 1000 0.0001 0.0001 0.0091
beta6 1000 0.0000 0.0000 0.0003
deviance 1000 0.0122 0.0122 0.1178

Model Diagnostics

In the following they are reported some Convergence Diagnostics, aimed at verifying the presence of converge issues, which might suggest to enlarge the number of simulations or use some other types of parametrizations.

Gelman-Rubin Diagnostic

Gelman-Rubin diagnostic evaluates the convergence by analyzing the difference between multiple Markov chains. Let:

  • \(M\) : Number of chians
  • \(T\): Number of iterations for each chain.

For each parameter, we compare the between-chains and within-chain estimated variances; large differences between them indicate nonconvergence.

  • Between-chain variance: \[ B_T = \frac{1}{M} \sum_{m=1}^M (\hat{I}^{(m)} - \hat{I}_T)^2 \]
  • Within-chain variance:

\[ W_T = \frac{1}{M} \sum_{m=1}^3 \left[ \frac{1}{T} \sum_{t=1}^T \left(h\left(\beta_t^{(m)}\right) -\hat{I}_T^{(m)}\right ) \right] \]

where:

  • \(\hat{I}_T^{(m)} = \frac{1}{T} \sum_{t=1}^{T} h\left(\beta_t^{(m)}\right)\)
  • \(\hat{I_T}\) is the overall estimator using the values coming from all of the \(M\) chains

Based on these measures, we compute the Potential Scale Reduction factor: \[ \hat{R_T} = \sqrt{ \frac{T-1}{T} W_T + \frac{B_T}{n}} \]

The model is healthy if \(\hat{R}_T \approx 1\) and is below some threshold (usually \(\hat{R}_T < 1.1\)).

Gelman-Rubin diagnostic
Point Estimate Upper CI
beta0 1.000 1.001
beta1 1.001 1.005
beta2 1.000 1.002
beta4 1.001 1.006
beta5 1.001 1.004
beta6 1.001 1.004
deviance 1.003 1.009
Multivariate PSRF
1.005

Geweke Diagnostics

Geweke Diagnostics consists in an inter chain convergence check. Compares the first 20% of the chain after burnin with the half last 50% percent; If the samples are drawn from the stationary distribution of the chain, the two means are equal and Geweke’s statistic has an asymptotically standard normal distribution. The test statistic is a standard Z-score, calculated under the assumption that the two parts of the chain are asymptotically independent: \[ Z = \frac{\hat{\beta}_A-\hat{\beta}_B}{\frac{1}{T_A}\hat{S}^{A}_{\beta}(0)+\frac{1}{T_B}\hat{S}^{B}_{\beta}(0)} \] where \(A\),\(B\) are two windows within the Markov chain and the standard error is estimated from the spectral density at zero and so takes into account any autocorrelation. If \(|Z|<1.96\) we accept the null hypothesis.

Z-scores for Geweke diagnostic
Chain 1 Chain 2 Chain 3
beta0 0.509 0.378 -0.066
beta1 0.154 -0.097 -1.011
beta2 -1.221 -0.114 -0.821
beta4 0.163 -0.651 0.417
beta5 0.020 0.396 0.742
beta6 0.642 -0.546 0.846
deviance -0.019 -0.547 -0.022

Heidelberg & Welch diagnostics

Uses a test statistic to verify the null hypothesis that the values sampled from the Markov Chian come from a stationary distribution. It composes of two parts:

  1. Stationary Test:

    1. Define an \(\alpha\) level.
    2. Test he null hypothesis that the chain is from a stationary distribution by calculating the Cramer-von-Mises statistic on the whole chain.
    3. If the null hypothesis is rejected, then discard the first 10% of the chain.
    4. Repeat the test, each time discarding the next 10% in case of rejection, until either the null hypothesis is accepted or 50% of the chain is discarded. If the test still rejects the null hypothesis, then the chain fails the test and needs to be run longer.
  2. Halfwidth Test: If the chain passes the first part of the diagnostic, then the part of the chain that was not discarded from the first part is used to test the second part.

Chain 1
Stationarity test start iteration p-value Halfwidth test Mean Halfwidth
beta0 pass 1 0.711 pass -8.864 0.043
beta1 pass 1 0.773 pass 0.117 0.002
beta2 pass 1 0.462 pass 0.035 0.000
beta4 pass 1 0.828 pass 0.100 0.001
beta5 pass 1 0.434 pass 0.914 0.018
beta6 pass 1 0.288 pass 0.011 0.001
deviance pass 1 0.160 pass 848.991 0.217
Chain 2
Stationarity test start iteration p-value Halfwidth test Mean Halfwidth
beta0 pass 1 0.986 pass -8.880 0.040
beta1 pass 1 0.880 pass 0.115 0.002
beta2 pass 1 0.809 pass 0.035 0.000
beta4 pass 1 0.806 pass 0.100 0.001
beta5 pass 1 0.386 pass 0.889 0.014
beta6 pass 1 0.786 pass 0.011 0.001
deviance pass 1 0.689 pass 848.647 0.217
Chain 3
Stationarity test start iteration p-value Halfwidth test Mean Halfwidth
beta0 pass 1 0.947 pass -8.848 0.044
beta1 pass 1 0.230 pass 0.116 0.002
beta2 pass 1 0.302 pass 0.035 0.000
beta4 pass 1 0.718 pass 0.099 0.001
beta5 pass 1 0.632 pass 0.915 0.018
beta6 pass 1 0.506 pass 0.011 0.001
deviance pass 1 0.965 pass 849.030 0.228

Validation

In order to validate our model, we can create, through simulation, a synthetic dataset from a distribution whose parameters (\(\beta\)’s) are fixed and thus, known. Then, we check if it is able to recover the true population parameters by applying it to the simulated data.

simulation parameters fixed parameters ratio
beta0 -8.787 -8.798 1.001
beta1 0.083 0.114 1.375
beta2 0.035 0.035 0.990
beta4 0.109 0.099 0.909
beta5 0.873 0.896 1.026
beta6 0.006 0.011 1.783

As we can notice, the approximation ratio is \(\approx 1\) for almost all the features. We can conclude that the model is able to recover the true parameter and confirm its adequacy.

Evaluation

We want to test our model by evaluating its predictive performances over new observations (test data);

Compare the metrics

Here they are reported the results of our Bayesian Logistic Regression Classifier,with the canonical cut-off level = 0.5, compared to the ones obtained by using different Machine Learning models, namely:

  • Frequentistic Logistic Regression
  • Random Forest
  • Support Vector Classifier
Bayesian_LR Frequentistic_LR RandomForest SVM
Accuracy 0.7884615 0.7740385 0.8317308 0.7788462
Precision 0.8155340 0.9029126 0.8476190 0.7961165
Recall 0.7706422 0.7153846 0.8240741 0.7663551
F1-score 0.7924528 0.7982833 0.8356808 0.7809524

As we can notice, Random Forest is the best performing model; nevertheless, our Bayesian Logistic Regression has an accuracy higher than both Frequentistic Logistic Regression and Support Vector Classifier, despite a lower Precision than the former. An important remark is that we are mainly interested in achieving an high recall, since we want to detect as many Diabetic people as possible to allow them to preventive care; our model overtakes both Frequentistic LR and SVM.

Confusion Matrix

ROC CURVE

ROC curve for different values of cutoffs \(p\) for classifying observations:

The area under the ROC curve is: 0.882

Bibliography

Mohammed, Ahmed & Hassan, Masoud & Kadir, Dler. (2020). Improving Classification Performance for a Novel Imbalanced Medical Dataset using SMOTE Method. International Journal of Advanced Trends in Computer Science and Engineering. 9. 3161-3172. 10.30534/ijatcse/2020/104932020.

Hassan, Masoud. (2020). A Fully Bayesian Logistic Regression Model for Classification of ZADA Diabetes Dataset. Science Journal of University of Zakho. 8. 105-111. 10.25271/sjuoz.2020.8.3.707.